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ABSTRACT 

Learning in multiagent systems can be slow because agents 
must learn both how to behave in a complex environment 
and how to account for the actions of other agents. The 
inability of an agent to distinguish between the true en- 
vironmental dynamics and those caused by the stochastic 
exploratory actions of other agents creates noise in each 
agent’s reward signal. This learning noise can have unfore- 
seen and often undesirable effects on the resultant system 
performance. We define such noise as exploratory action 
noise, demonstrate the critical impact it can have on the 
learning process in multiagent settings, and introduce a re- 
ward structure to effectively remove such noise from each 
agent’s reward signal. In particular, we introduce Coordi- 
nated Learning without Exploratory Action Noise (CLEAN) 
rewards and empirically demonstrate their benefits. 
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1. INTRODUCTION 

In multiagent systems, agents provide a constantly chang- 
ing background in which each agent needs to learn its task 
[3, 4, 8]. As a consequence, agents need to extract the un- 
derlying reward signal from the noise of other agents act- 
ing within the environment. Issues arise when agents are 
treated as a part of the environment, and their exploratory 
actions are seen by other agents as stochastic environmental 
dynamics. The inability of agents to distinguish the true 
environmental dynamics from those caused by the stochas- 
tic exploratory actions of other agents creates noise on each 
agent’s reward signal. This problem cannot simply be ad- 
dressed by turning off exploration and acting greedily (this 
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has been repeatedly shown to result in poor performance as 
agents always exploit their current knowledge which is fre- 
quently incomplete or inaccurate). Additionally, methods 
of slowly turning down exploration (e.g., annealing) or in- 
telligently modifying exploration (e.g., Win-Or-Learn-Fast 
WOLF [2]) fail to fully address this issue. This is because 
these techniques still rely upon agents taking explicit ex- 
ploratory actions within the environment. CLEAN rewards 
address this issue using implicit exploration via counterfac- 
tual action based reward shaping techniques, such that all 
explicit exploration is removed from the learning process. 

2. EXPLORATORY ACTION NOISE 

Agents often treat other agents as part of the environment 
— the exploratory actions of other agents become stochas- 
tic environmental noise [5, 6, 7, 10]. Here, agents are then 
unable to distinguish when their peers are taking purpose- 
ful actions or are exploring. This may cause agents bias 
their policies such that they actually depend upon the ex- 
ploratory actions of other agents to perform well. Agents 
learning in the presence of exploration may not be learning 
optimal policies (Figure la) because agents cannot distin- 
guish between true environmental dynamics and dynamics 
caused by the exploratory actions of other agents. Here, 
agents (the solution) actually become part of the problem 
(adding stochastic noise to the environment). This holds for 
both off-line and on-line learning methods. 

3. CLEAN REWARDS 

Coordinated Learning without Exploratory Action Noise 
(CLEAN) rewards address the structural credit assignment 
problem and issues arising from learning noise caused by 
exploration to promote learning, coordination, and scalabil- 
ity. CLEAN rewards separate explicit from implicit explo- 
ration. Agents behave greedily outwardly (explicitly) and 
explore internally (implicitly) via counterfactual exploratory 
actions. Agents use Equation 2 to perform counterfactual 
reward calculations: 

Ci,t(a) = G(a ai< _ a if) — G(a) (1) 

a is the system action vector and at is agent i’s action. 
This gives the agent a reward that represents how the sys- 
tem would have performed had it not followed its best pol- 
icy, but instead had taken some counterfactual action, a[. 
CLEAN rewards use implicit counterfactual exploration to 
eliminate explicit exploratory action noise within the envi- 
ronment. The Gaussian Squeeze Domain A set of agents 
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Figure 1: Gaussian Squeeze Domain Results: a)When agents stop exploring in the GSD domain (episode 1000), system 
performance decreases due to exploratory action noise, b) CLEAN rewards outperform global and difference rewards, c) 
CLEAN rewards maintain superior performance as scaling increases. 


in attempt to learn to optimize the capacity of the following 
system objective: 

— ( x — e ) 2 

G[x ) = xe P 1 (2) 

where x is the sum of the actions of agents ( x = a *); 

fi is the mean and a is the standard deviation of the system 
objective. Here, p and a define the target capacity, x, that 
the agents must coordinate their actions to achieve. 

4. RESULTS 

There were four types of experiments: random agents 
(baseline) and three types of Q-learning agents (global (G), 
difference (D) [1], and CLEAN (Ci,i). Figure lb shows the 
results for 1000 agents learning in the Gaussian Squeeze Do- 
main with p — 175, a = 175. The performance of agents 
using global rewards G is poor in both the online and the 
offline settings because global rewards do not provide indi- 
vidual agents with specific feedback on how their individ- 
ual actions impacted the system performance compared to 
the actions of all of the other agents in the system (i.e., 
each agent’s reward signal gets lost in the “noise” of the rest 
of the system). Agents using difference rewards D outper- 
formed agents using G because difference rewards provide 
each agent with a reward that is reflective of it’s own indi- 
vidual impact on the system performance. Unfortunately, 
difference rewards do not address the issues associated with 
exploratory action noise. The disparity in performance be- 
tween CLEAN rewards and difference rewards can be di- 
rectly attributed to the impact of exploratory action noise 
on the learning process. As seen, exploratory action noise 
can have a massive impact on learning performance, espe- 
cially in large tightly coupled multiagent systems. Agents 
using CLEAN rewards all converge to nearly optimal per- 
formance, maintaining 5 times the performance of the next 
best technique (i.e., D) with scaling up to 1000 agents. 

The GSD experiment in Figure lc considers how perfor- 
mance changes as complexity increases. Figure lc shows the 
results of scaling the number of agents with a fixed mean 
and variance (p = 175 and er = 175). CLEAN rewards are 
more robust to scaling (increased congestion) than G and D 
because agents receive a cleaner learning signal. 

5. DISCUSSION AND CONCLUSION 

There has been a lot of research involving the exploration- 
exploitation tradeoff within the multiagent learning litera- 
ture. However, relatively little work has been done to di- 
rectly address the impact of learning noise caused by the 


exploratory actions of agents. We first showed the potential 
impact of exploratory action noise on learning, demonstrat- 
ing that exploratory actions can cause agents to bias their 
policies to depend upon the exploratory actions of others, 
which can lead to suboptimal learning. We then introduced 
CLEAN rewards, which are shaped rewards that promote co- 
ordination and scalability in multiagent systems by address- 
ing exploratory action noise caused by agent exploration. 
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